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Abstract 

This work establishes a new upper bound on the number of samples sufficient for PAC 
learning in the realizable case. The bound matches known lower bounds up to numerical 
constant factors. This solves a long-standing open problem on the sample complexity of 
PAC learning. The technique and analysis build on a recent breakthrough by Hans Simon. 


1. Introduction 


Probably approximately correct learning (or PAC learning; IValiantl . Il984l ) is a classic cri¬ 
terion for supervised learning, which has been the focus of much research in the past three 
decades. The objective in PAC learning is to produce a classifier that, with probability at 
least 1 — 5, has error rate at most e. To qualify as a PAC learning algorithm, it must satisfy 
this guarantee for all possible target concepts in a given family, under all possible data dis¬ 
tributions. To achieve this objective, the learning algorithm is supplied with a number m 
of i.i.d. training samples (data points), along with the corresponding correct classifications. 
One of the central questions in the study of PAC learning is determining the minimum 
number Al(e,5) of training samples necessary and sufficient such that there exists a PAC 
learning algorithm requiring at most A4(e, 6) samples (for any given e and 5). This quantity 
is known as the sample complexity. 

Determining the sample complexity of PAC learning is a long-standing open problem. 
There have been upper and lower bounds established for decades, but they differ by a 
logarithmic factor. It has been widely believed that this logarithmic factor can be removed 
for certain well-designe d learni n g alg orithms, and attempting to prove this has been the 


subject of much effort. ISimonI (j2ni5l ) has very recently made an enormous leap forward 


toward resolving this issue. That work proposed an algorithm that classifies points based 
on a majority vote among classifiers trained on independent data sets. Simon proves that 
this algorithm achieves a sample complexity that reduces the logarithmic factor in the 
upper bound down to a very slowly-growing function. However, that work does not quite 
completely resolve the gap, so that determining the optimal sample complexity remains 
open. 

The present work resolves this problem by completely eliminating the logarithmic factor. 
The algorithm achieving this new bound is also based on a majority vote of classifiers. 
However, unlike Simon’s algorithm, here the voting classifiers are trained on data subsets 
specified by a recursive algorithm, with substantial overlaps among the data subsets the 
classifiers are trained on. 


2. Notation 

We begin by introducing some basic notation essential to the discussion. Fix a nonempty 
set A, called the instance space; we suppose A is equipped with a cr-algebra, defining the 
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measurable subsets of X. Also denote y = {—1,+1}, called the label space. A classifier is 
any measurable function h : X ^ y. Fix a noneimty set C of classifiers, called the concept 
space. To focus the discussion on nontrivial casesjj we suppose |C| > 3; other than this, the 
results in this article will be valid for any choice of C. 

In the learning problem, there is a probability measure V over X, called the data 
distribution, and a sequence Xi{V), X 2 {V ),... of independent P-distributed random vari¬ 
ables, called the unlabeled data; for m € N, also define Xi.m('P) = {Xi{V),... ,Am('P)}, 
and for completeness denote Xi:o('P) = {}. There is also a special element of C, de¬ 
noted /*, called the target function. For any sequence Sx = {xi,... ,Xk} in X, denote by 
{Sx, f*{Sx)) = {{xi, f*{xi)),..., {xk, f*{xk))}. For any probability measure P over X, and 
any classifier h, denote by erp(/i;/*) = P{x : h{x) / f*{x)). A learning algorithm A is a 
mapH mapping any sequence {(xi,yi),..., {xm,ym)} in A x T (called a data set), of any 
length m G N U {0}, to a classifier h : X ^ y (not necessarily in C). 

Definition 1 For any e,(5 G (0,1), the sample complexity of (e, 5)-PAC learning, de¬ 
noted M.{e,5), is defined as the smallest m G N U {0} for which there exists a learn¬ 
ing algorithm A such that, for every possible data distribution V, Vf* G C, denoting 


P ^er-p (^; f*^ < > 1 — d. 

If no such m exists, define Ai{e,6) = oo. 


The sample complexity is our primary object of study in this work. We require a few 
additional definitions before proceeding. Throughout, we use a natural extension of set 
notation to sequences: for any finite sequences we denote by U 

{bi}^^^ the concatenated sequence {ai,... ,ak,bi,... ,bk'}. For any set A, we denote by 
n A the subsequence comprised of all ai for which a* G A. Additionally, we write b G 
to indicate 3i < k' s.t. bt = h, and we write {ai}jL;^ C {hi}\'^,^ or {hi}^'^^ A 
to express that Oj G for every j < k. We also denote = k (the length of 

the sequence). For any A: G N U {0} and any sequence S = {{xi, yi),..., {xk, yk)} of points 
in A X T, denote C[5] = {/i G C ; V(x, y) G S, h{x) = y}, referred to as the set of classifiers 
consistent wi th S. _ 

Following Vapnik and Chervonenkii ( 1971 1. we say a sequence {xi,... ,Xk} of points 
in A is shattered by C if 'iyi, ..., yfc G A, 3/i G C such that Vi G {1, ... , k}, h{xi) = yp. 
that is, there are 2^ distinct classifications of {xi,... ,Xk'\ realized by classifiers in C. The 
Vapnik-Chervonenkis dimension (or VC dimension) of C is then defined as the largest 
integer k for which there exists a sequence {xi,... ,Xk} in A shattered by C; if no such 
largest k exists, the VC dimension is said to be infinite. We denote by d the VC dimension 
of C. This quantity is of fundamental importance in characterizing the sample complexity 


1. The sample complexities for |C| = 1 and |C| = 2 are already quite well understood in the literature, 
the former having sample complexity 0, and the latter having sample complexity either 1 or 0(iln j) 
(depending on whether the two classifiers are exact complements or not). 

2. We also admit randomized algorithms A, where the “internal randomness” of A is assumed to be inde¬ 
pendent of the data. Formally, there is a random variable R independent of {Xi{P)}i^p such that the 
value A{S) is determined by the input data S and the value of R. 
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of PAC learning. In particular, it is well known t hat th e sample complexity is finite for 


any £,6 € lO . li if and only if c? < oo (^pnikl. 1982;_Blumer , Etir enfeucht, Haussler, 


and Warmiith. Il989l : Ehrenfeiicht. Haussler. Kearns, and Valiant . 19891 ). For simplicity of 
notation, for the remainder of this article we suppose d < oo; furthermore, note that our 
assumption of |C| > 3 implies d > 1. 

We adopt a common variation on big-0 asymptotic notation, used in much of the learn¬ 
ing theory literature. Specihcally, for functions f,g : (0,1)^ —)• [0,oo), we let /(e, 5) = 
0{g{£,5)) denote the assertion that 3eo,do G (0,1) and cq G (0,oo) such that, Ve G (0,eo), 
V(5 G (0,(5o), /(e,(5) < cog{£,6)] however, we also require that the values £o,So,co in this 
definition be numerical constants, meaning that they are independent of C and X. For 
instance, this means cq cannot depend on d. We equivalently write f{£,S) = id{g{£,6)) to 
assert that g{£,6) = 0(/(e, <5)). Finally, we write f{£,6) = @{g{£,6)) to assert that both 
f{£,6) = 0{g{£,5)) and f{£,S) = Q{g{£,S)) hold. We also sometimes write 0{g{£,5)) in an 
expression, as a place-holder for some function f{£,S) satisfying f{£,S) = 0{g{e,6)): for 
instance, the statement N{£,6) < d + 0{\.og{\/5)) expresses that 3/(e,d) = 0(log(l/(5)) for 
which N{£,6) < d + /(e,d). Also, for any value z > 0, define Log{z) = ln(max{ 2 :, e}) and 
similarly Log 2 ( 2 :) = log 2 (max{z, 2}). 

As is commonly required in the learning theory literature, we adopt the assumption that 
the events appearing in probability claims below are indeed measurable. For our purposes, 
this comes into effect only in the applicati on of classic generalization bounds for sample- 
consis tent classifiers (Lemma [H below) . See lBlumer. Fhrenfeucht. Haussler. and Warmuth 
( 19891 ) and van der Vaart and Wellnei ( 199f)l ) for discussion of conditions on C sufficient for 
this measurability assumption to hold. 


3. Background 

Our objective in this work is to establish sharp sample complexity bounds. As such, we 
should first review the known lower bounds on Ai(£,6). A basic lower bo und of (i) 

was established by Blumer. Fhrenfeucht. Haussler. and Warmuth ( 1989h for 0 < e < 1/2 
and 0 < (5 < 1. A second lower bound of was supplied bv Fhrenfeucht. Haussler, 
Kearns, and Valiant (Il989li . for 0 < e < 1/8"and 0 < 5 < 1/100. Taken together, these 
results imply that, for any e G (0,1/8] and 6 G (0,1/100], 


5) > aiax 


d — I 1 — £ 


32e 


In 


= n 


l(d + Log(l 


( 1 ) 


This lower bound is comp leme nted by classic upper bounds on the sample com p lexity . 
In particular, IVapnikI ()l982l ) and iBlumer. Fhrenfeucht. Haussler. and WarmuthI (119891 ) 
established an upper bound of 


Ad(e, (5) = O Q ^dLog -k Log Q 


( 2 ) 


They proved that this sample complexity bound is in fact achieved by any algorithm that 
returns a classiher h G C[(Xi;m('P), /*(Xi:m('P)))], also known as a sample-consistent learn¬ 
ing algorithm (or empirical risk minimization algorithm). A sometimes-better upper bound 
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was established bv iHaussler. Littlestone. and WarmuthI ([199^: 


M{e,6)=0(^Log Q 


(3) 


This bound is achieved by a modihed va riant of the one-inclusion graph predic ti on alg orithm, 
a learning algorithm also proposed bv iHaussler. Littlestone. and WarmuthI (Il994li . which 
has been conjectured to achieve the op timal sample complexity (jWarmuthl . 12004 ). 

In very recent work, ISimonI ( 20151 ) produced a breakthrough insight. Specihcally, by 
analyzing a learning algorithm based on a simple major ity vote among classihers consistent 
with distinct subsets of the training data, Simon ( 20151 ) established that, for any K ^ N, 


M{e,6) =0 




dlogW Q 


+ K + Log 



(4) 


where log^‘^^(x) is the /C-times iterated logarithm: log^^^(x) = maxja;, 1} and log^^^(x) = 
max {log2(log*^'^ (x)), 1}. In particular, one natural choice would be iL ~ log*(4)j^ which 
(one can show) optimizes the asymptotic dependence on e in the bound, yielding 

M{e,5) = O (^d + Log Q 

In general, the entire form of the bound (Uj) is optimized (up to numerical constant factors) 
by choosing K = max{log*(4) — log*(^Log(|)) + l,l}. Note that, with either of these 
choices of K, there is a range of e, 5, and d values for which the bound (jH) is strictly 
smaller than both ([2]) and ([3]): for instance, for small e, it suffices to have Log(l/(5) <C 
dLog( 1 /e)/(2^ -^/log*(1/e)) while 2^Vlog*(1/e) min{Log(l/(5), d}. How¬ 

ever, this bound still does not quite match the form of the lower bound ([T]). 

There have also been many special-case analyses, studying restricted type s of concept 


spaces C for which t he above gaps can be closed (e.g., lAuer and Ortnerl . 120071 : iDarnstadt 


20151 : iHannekd . 120151 ). However, these special conditions do not include many of the most- 
commonly studied concept spaces, such as linear separators and multilayer neural networks. 
There have also been a variety of studies that, in addition to restricting to specihc concept 
spaces C, also introduce strong restrictions on the data distribution P, and establ ish an 
uppe r bound of the same form as the lower bound (III) under these restrictions (e.g.. iLone , 
2003 : Gine and Koltchinskii . 2006 : Bshoutv. Li. and Lone . 20091 : Hanneke . 20091 . 2015 : 
Balcan and Lond . l2013l ). However, there are many interesting classes C and distributions 
V for which these results do not imply any improvements over ([2]). Thus, in the present 
literature, there persists a gap between the lower bound ([T]) and the minimum of all of the 
known upper bounds ([2|), (l3|), and dH) applicable to the general case of an arbitrary concept 
space of a given VC dimension d (under arbitrary data distributions). 

In the present work, we establish a new upper bound for a novel learning algorithm, 
which holds for any concept space C, and which improves over all of the above general 
upper bounds in its joint dependence on e, 5, and d. In particular, it is optimal, in the 


3. The function log*( 2 :) is the iterated logarithm: the smallest 71 € N U {0} for which log^^^(a:) <1. It is 
an extremely slowly growing function of x. 
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sense that it matches the lower bound ([T|) up to numerical constant factors. This work thus 
solves a long-standing open problem, by determining the precise form of the optimal sample 
complexity, up to numerical constant factors. 


4. Main Result 

This section presents the main contributions of this work: a novel learning algorithm, and 
a proof that it achieves the optimal sample complexity. 


4.1 Sketch of the Approach 


The general approach used here builds on an argument of ISimonI (1201511 . which i tself has 
roots in the analysis of sample- consist e nt lea rning algorithms by iHannekd (|2009l . Section 
2.9.1). The essential idea from Simon ( 2015h is that, if we have two classifiers, h and g, 
the latter of which is an element of C consistent with an i.i.d. data set S independent from 
h, then we can analyze the probability that they both make a mistake on a random point 
by bounding the error rate of h under the distribution V, and bounding the error rate of 
g under the conditional distribution given that h makes a mistake. In particular, it will 
either be the case that h itself has small error rate, or else (if h has error rate larger than 
our desired bound) with high probability, the number of points in S contained in the error 
region of h will be at least some number oc ev-p{h] /*)|S'|; in the latter case, we can bound the 
conditional error rate of g in terms of the number of such points via a classic generalization 
bound for sample-consistent classifiers (Lemma 0] below). Multiplying this bound on the 
conditional error rate of g by the error rate of h results in a bound on the probability they 
both make a mistake. More specifically, this argument yields a bound of the following form: 
for an appropriate numerical constant c G (0, oo), with probability at least 1 — 6,\/g £ C[5], 


Vlx: h{x) = g{x) / f*{x)) < 


| 5 | 


dLog 


erv{h-J*)\S\ 

d 


+ Log 



The original analysis of ISimonI (j2015l l applied this reasoning repeatedly, in an inductive 
argument, thereby bounding the probability that K classifiers, each consistent with one of 
K independent training sets, all make a mistake on a random point. He then reasoned that 
the error rate of the majority vote of 2K — 1 such classifiers can be bounded by the sum of 
these bounds for all subsets of K of these classifiers, since the majority vote classifier agrees 
with at least K of the constituent classifiers. 

In the present work, we also consider a simple majority vote of a number of classifiers, 
but we alter the way the data is split up, allowing signihcant overlaps among the subsamples. 
In particular, each classiher is trained on considerably more data this way. We construct 
these subsamples recursively, motivated by an inductive analysis of the sample complexity. 
At each stage, we have a working set S of i.i.d. data points, and another sequence T of 
data points, referred to as the partially-constructed subsample. As a terminal case, if [S'! is 
smaller than a certain cutoff size, we generate a subsample SUT, on which we will train a 
classiher g G C[S' U T]. Otherwise (for the nonterminal case), we use (roughly) a constant 
fraction of the points in S to form a subsequence Sq, and make three recursive calls to the 
algorithm, using Sq as the working set in each call. By an inductive hypothesis, for each 
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of these three recursive calls, with probability 1 — 5'^ the majority vote of the classifiers 
trained on subsamples generated by that call has error rate at most {d + Log(l/(5')), for 
an appropriate numerical constant c. These three majority vote classifiers, denoted hi, /i 2 , 
/ 13 , will each play the role of h in the argument above. 

With the remaining constant fraction of data points in S (i.e., those not used to form 
Sq), we divide them into three independent subsequences 5i, S 2 , S 3 . Then for each of 
the three recursive calls, we provide as its partially-constructed subsample (i.e., the “T” 
argument) a sequence Si U Sj U T with i ^ j] specifically, for the recursive call {k G 
{1,2,3}), we take {i,j} = {1,2,3} \ {k}. Since the argument T is retained within the 
partially-constructed subsample passed to each recursive call, a simple inductive argument 
reveals that, for each i G {1,2,3}, Mk G {1,2,3} \ {i}, all of the classifiers g trained on 
subsamples generated in the k^'^ recursive call are contained in C[S'j]. Furthermore, since 
Si is not included in the argument to the recursive call, hi and Si are independent. 
Thus, by the argument discussed above, applied with h = hi and S = Si, we have that with 
probability at least 1 — 5', for any g trained on a subsample generated in recursive calls 
k G {1,2,3} \ {i}, the probability that both hi and g make a mistake on a random point 
is at most ^dLog _|_ Lpg Composing this with the aforementioned 

inductive hypothesis, recalling that \Si\ oc [S'! and |5o| oc IS"!, and simplifying by a bit of 
calculus, this is at most (dLogfc) -|- Log (^)), for an appropriate numerical constant c^ 
By choosing 5' oc 6 appropriately, the union bound implies that, with probability at least 
1 — (5, this holds for all choices of z G {1, 2, 3}. Furthermore, by choosing c sufficiently large, 
this bound is at most (d -|- Log (|)). 

To complete the inductive argument, we then note that on any point x, the majority 
vote of all of the classifiers (from all three recursive calls) must agree with at least one 
of the three classifiers hi, and must agree with at least 1/4 of the classifiers g trained on 
subsamples generated in recursive calls k G {1,2,3} \ {z}. Therefore, on any point x for 
which the majority vote makes a mistake, with probability at least 1/12, a uniform random 
choice of z G {1,2,3}, and of g from recursive calls k G {1,2,3} \ {z}, results in hi and 
g that both make a mistake on x. Applying this fact to a random point X ^ V (and 
invoking Fubini’s theorem), this implies that the error rate of the majority vote is at most 
12 times the average (over choices of z and g) of the probabilities that hi and g both make 
a mistake on X. Combined with the above bound, this is at most j§| (d -|- Log (|)). The 
formal details are provided below. 


4.2 Formal Details 


For any /c G N U {0}, and any S' G (A x y)^ with C[S] / 0, let L{S) denote an ar¬ 
bitrary classifier h in C[S], entirely determined by S: that is, L(-) is a fixed sample- 
consistent learning algorithm (i.e., empirical risk minimizer). For any A: G N and sequence of 
data sets {Si,..., S^}, denote L({Si,..., S^}) = {L(Si),..., L{Sk)}. Also, for any values 


yi,...,yk e 3^, define the majority function: Majority (z/i,..., ?/fc) = sign = 




21 


Majority(/ii,. 


— 1. We also overload this notation, defining the majority classifier 
, hk){x) = Majority(/ii(x),..., hk{x)), for any classifiers hi, ..., h^. 
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Now consider the following recursive algorithm, which takes as input two finite data sets, 
S and r, satisfying C[S U T] 7 ^ 0, and returns a finite sequence of data sets (referred to as 
subsamples oi SUT). The classifier used to achieve the new sample complexity bound below 
is simply the majority vote of the classifiers obtained by applying L to these subsamples. 

Algorithm: A(S';T) 

0. If \S\ < 3 

1. Return {SUT} 

2 . Let So denote the first |5| — 3[|5|/4J elements of S, Si the next [jS'l/dJ elements, 

S 2 the next [jS'l/dJ elements, and S 3 the remaining [|5'|/4J elements after that 

3. Return A(5o; ^2 U ^3 U T) U A{So; Si U ^3 U T) U A(So; Si U S 2 U T) 

Theorem 2 

AI(e,^)=o(i(<i + Log(i))), 

In particular, a sample complexity of the form expressed on the right hand side is achieved 
by the algorithm that returns the classifier Majority(L(A(S; 0))), given any data set S. 

Combined with ([T]), this immediately implies the following corollary. 

Corollary 3 

MM) = e(i(<i + Log(l))). 

The algorithm A is expressed above as a recursive method for constructing a sequence of 
subsamples, as this form is most suitable for the arguments in the proof below. However, it 
should be noted that one can equivalently describe these constructed subsamples directly, as 
the selection of which data points should be included in which subsamples can be expressed 
as a simple function of the indices. To illustrate this, consider the simplest case in which 
S = {{xo,yo), ■ ■ ■, {xm-i,ym-i)} with m = 4^ for some £ G N: that is, lAI is a power of 

4. In this case, let {Tq, ..., r„_i} denote the sequence of labeled data sets returned by 
A(5;0), and note that since each recursive call reduces |5| by a factor of 4 while making 
3 recursive calls, we have n = 3^. First, note that (xo,yo) is contained in every subsample 
Tj. For the rest, consider any i G {1,..., m — 1} and j G {0,... , n — 1}, and let us express 
i in its base-4 representation as i = where each G {0,1,2,3}, and express j 

in its base-3 representation as j = Ylt=o where each jt G {0,1, 2}. Then it holds that 
(xi, Vi) G Tj if and only if the largest t G {0, ...,£ — 1} with it / 0 satisfies it — 1 jt- This 
kind of direct description of the subsamples is also possible when \S\ is not a power of 4, 
though a bit more complicated to express. 


4.3 Proof of Theorem [2] 


The following cl assic result wi ll be needed in the proof. A bound of this type is implied 
by a theorem of IVapnikI (Il982h: the version stated here features s l ightly smaller constant 
factors, obtained by Blumer. Ehrenfeucht. Haussler. and Warmuth ISjEl 


4. Specifically, it follows by combining their Theorem A2.1 and Proposition A2.1, setting the resulting 
expression equal to S and solving for e. 
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Lemma 4 For any 5 € (0,1), m € N, /* € C, and any probability measure P over X, 
letting Zi,... ,Zm be independent P-distributed random variables, with probability at least 
1-5, every h G C[{{Zi, f*{Zi))}^^] satisfies 

erp(ft;r) < I (dLog, +Lofe 0)) , 

We are now ready for the proof of Theorem [2j 
Proof of Theorem [2] Fix any /* G C and probability measure V over X, and for brevity, 
denote Si:m = i^i-.m{P), f*i^i-.miP))), for each m G N. Also, for any classifier h, define 
ER{h) = {xeX : h{x) 7 ^ /*(x)}. 

We begin by noting that, for any finite sequences S and T of points in A x T, a straight¬ 
forward inductive argument reveals that all of the subsamples S in the sequence returned 
by A{S]T) satisfy S P S UT (since no additional data points are ever introduced in any 
step). Thus, if /* G C[S] and /* G C[r], then /* G C[S] n C[T] = C[S U T] C C[S], so that 
C[S] 0. In particular, this means that, in this case, each of these subsamples S is a valid 
input to T(-), and thus L{A{S-,T)) is a well-dehned sequence of classifiers. Furthermore, 
since the recursive calls all have T as a subsequence of their second arguments, and the 
terminal case (i.e.. Step 1) includes this second argument in the constructed subsample, 
another straightforward inductive argument implies that every subsample S returned by 
A{S]T) satisfies S P T. Thus, in the case that /* G C[5] and /* G C[r], by definition of 
L, we also have that every classifier h in the sequence L(A(S';T)) satisfies h G C[T]. 

Fix a numerical constant c = 1800. We will prove by induction that, for any m' G N, 
for every 5' G (0,1), and every finite sequence T' of points in T x T with f* G C[T'], with 
probability at least 1 — 6 ', the classifier hm',T' = Majority {L (A(Si.m'; T'))) satisfies 

erp(h^.,r';r) < ^^^ 7 ^ + (5) 

First, as a base case, consider any m' G N with m' < cln(18e) — 1. In this case, fix any 
5' G (0,1) and any sequence T' with /* G C[r']. Also note that f* G Thus, as 

discussed above, hm',T' is a well-defined classifier. We then trivially have 

T.; /*) < 1 < (1 + ln(18)) < ^ {d + In , 


so that (j5]) holds. 

Now take as an inductive hypothesis that, for some m G N with m > cln(18e) — 1, for 
every m' G N with m' < m, we have that for every S' G (0,1) and every finite sequence T' 
in A X T with /* G C[T'], with probability at least 1 — <5', ([5]) is satisfied. To complete the 
inductive proof, we aim to establish that this remains the case with m' = m as well. Fix 
any 5 G (0,1) and any finite sequence T of points in A x T with /* G C[r]. Note that 
cln(18e) — 1 > 3, so that (since |Si:m| = m > cln(18e) — 1) we have A 4, and hence 

the execution of A(Si;m;T) returns in Step 3 (not Step 1). Let Sq, Si, S 2 , S 3 be as in the 
definition of A{S; T), with S = §i:m- Also denote Ti = ^2 U 53 U T, r 2 = •S'l U ^3 U T, T 3 = 
SiUS 2 PT, and for each i G {1, 2,3}, denote hi = Majority (A (A(S'o; Ti))), corresponding to 
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the majority votes of classifiers trained on the subsamples from each of the three recursive 
calls in the algorithm. 

Note that Sq = Si:(m- 3 [m/ 4 j)- Furthermore, since m > 4, we have 1 < m —3[m/4j < m. 
Also note that /* G C[5i] for each i G {1,2, 3}, which, together with the fact that f* G C[T], 
implies /* G C[r] n fljeii 2 3}\{i} ~ ^[^i] for each i G {1, 2,3}. Thus, since /* G C[5o] 

as well, for each i G {1,2,3}, L{K{SQ]Ti)) is a well-defined sequence of classifiers (as 
discussed above), so that hi is also well-dehned. In particular, note that hi = • 

Therefore, by the inductive hypothesis (applied under the conditional distribution given 5i, 
S 2 -, S 3 , which are independent of Sq), combined with the law of total probability, for each 
i G {1, 2,3}, there is an event Ei of probability at least 1 — 5/9, on which 

(^)) - ^ (^)) ■ 


Next, fix any i G {1,2,3}, and denote by {{Zi^i, f*{Zi^i)),..., (Zi^Ni, f*{Zi,Ni))} = 
Si n (ER(hj) X T), where Ni = |S'jn(ER(hj) xT)!: that is, {{Zi^t, f*iZi,t))}f^i is the 
subsequence of elements {x,y) in Si for which x G ER(/ij). Note that, since hi and Si are 
independent, Zj^i,..., are conditionally independent given hi and Ni, each with condi¬ 
tional distribution R(-|ER(hj)) (if A} > 0). Thus, applying Lemma U] under the conditional 


distribution given hi and Ni, combined with the law of total probabi 
an event E[ of probability at least 1 — 5/9, if > 0, then every h G C 
satisfies 


ity, we have that on 
{{z^,t,r{z^,t))}^^, 


eTp(.\EKihi)){h-, f*) < — ( dLog 2 


2eiV, 

d 


T . 18 

+ Log2 ( -p 


Furthermore, as discussed above, each j G {1,2,3} \ {i} and h G L {A{So;Tj)) have h G 


C[r,], and T, D Si D {(Z^,*,/^(Zi,*))}^^, so that C[r,] C C {(Z^,*,/*(Zi,*))})!! 


Ni 

t=l 


It 


follows that every h G Uje{i, 2 , 3 }\{i} ^ (A(5o; Tj)) has /i G C {{Zi^t, f*{Zi,t))}^^^ 


. Thus, on 


the event E'-, if Ni > 0, V/i G UjG{i, 2 , 3 }\{i} HMSo^Tj)), 


V{ER{hi) nER{h)) = V{ER{hi))V{ER{h)\ER{hi)) 

= R(ER(h,))erp(.|ER(,,))(h;r) <R(ER(h,))J- (dLog 2 + Log 2 (y)) . (7) 


Additionally, since hi and Si are independent, by a Chernoff bound (applied under the 
conditional distribution given hi) and the law of total probability, there is an event E'/ of 
probability at least 1 — 5/9, on which, if V{ER{hi)) > In (|) > In (f), then 

N^ > {7/W)V{ER{hi))\Si\ = (7/10)iP(ER(hi))Lm/4j. 

In particular, on E'/, if V{ER{hi)) > In (|), then the above inequality implies Ni > 0. 

Combining this with ([6]) and ([7]), and noting that Log 2 (a:) < Log(x)/ln(2) and x (->■ 
iLog(c'x) is nonincreasing on (0,oo) (for any fixed c' > 0), we have that on Ei n E[ n E'/, 
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if ViER{hi)) > In (f), then every h e UjG{i,2,3}\{i} L{k{So\Tj)) has 

20 


P(ER(/ii) nER(/i)) < 


< 


20 ( (7/5)ec(ci + ln(^)) 

71n(2)[m/4j \ ^\ d 



< 


20 


71n(2) [m/4j 
20 


(^dLog(^2/b)c (^(7/2)e + ^In^ ^ ^ + Log^ ^ 

( 8 ) 


18 


- 71n(2)Lm/4j (^M(9/5)ec) + 8 In ^ ^ 

where this last inequality is due to Lemma[S]in Appendix^ Since m > cln(18e) — 1 > 3200, 
we have [m/4j > (m — 4)/4 > ^ . Plugging this relaxation into the above 

bound, combined with numerical calculation of the logarithmic factor (with c as defined 
above), we find that the expression in Q is less than 

150 / , /18 

m + 1 \ \ 0 

Additionally, if P(ER(/ij)) < In (|), then monotonicity of measures implies 


vmih) n ER{h)) < V{ER{h,)) < In (I) < 

[m/4J \d J m + 1 


, , / 18 
d + In ( — 


again using the above lower bound on [m/4j for this last inequality. Thus, regardless of 
the value of V{ER{hi)), on the event Ei n E[ n Ef, we have V/i G Ujg{i 2 3 }\{i} dj{E{So', Tj)), 

V{ERihi) n ER{h)) < f d + In f ^ 

m + 1 \ \ 0 

Now denote /imaj = hm,T = Majority (L(A(5'; T))), again with S = §i:m- By definition 
of Majority(-), for any x G A, at least 1/2 of the classifiers h in the sequence L(A(S';T)) 
have h{x) = hmaiix). Erom this fact, the strong form of the pigeonhole principle implies 
that at least one i G {1,2,3} has hi{x) = h^g^^{x) (i.e., the majority vote must agree 
with the majority of classifiers in at least one of the three subsequences of classifiers). 
Eurthermore, since each A{SQ;Tj) (with j G {1,2,3}) supplies an equal number of entries 
to the sequence A{S;T) (by a straightforward inductive argument), for each i G {1,2,3}, 
at least 1/4 of the classifiers h in 2,3}\{i}7y(A(5o; Tj)) have h{x) = hmajix)- that 


is, since |L(A(5o;Tj))! = (1/3) 


(1/4) 


UjG{l,2,3}\{i} Tj)) 


L{A{S]T))\, we must have at least (1/6)|L(A(5;T))| = 
classifiers h in Ujg{i,2,3}\{i} T{So-,Tj) with h{x) = /imaj(a;) 


in order to meet the total of at least (1/2)|L(A(S;T))| classifiers h G L(A(S';T)) with 
h{x) = hraa]{x). In particular, letting I be a random variable uniformly distributed on 
{1, 2,3} (independent of the data), and letting d be a random variable conditionally (given 
I and S) uniformly distributed on the classifiers Ujg{i^2,3}\{7}7^(^('S'o; 7}'))) this implies 
that for any fixed x G ER(/iniaj)j with conditional (given S) probability at least 1/12, 
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hi{x) = h{x) = hms^j{x), so that x G ER(h 7 ) n ER(/i) as well. Thus, for a random variable 
X ^ V (independent of the data and the law of total probability and monotonicity of 
conditional expectations imply 


E 


V 


(ER{hi)nER{h)^ S =E f(^X GER{hi)nER{h) 


LXS 


S 


= E 
> E 


1 


X G ER(h/) nER(h) 


5 


= E 


P(X GER(/i/)nER(h) S,X) 1 [X G ER(/imaj)] 


P ( X G ER{hi) n ER(h) S, X 


S 


S 


> E [(1/12)1 [X G ER(h„aj)]|5] = (l/12)erp(h„aj;/*). 


Thus, on the event fljeii 2 3 }-^**^ ^ 


E'J 


erp(/imaj;/*) < 12E V (ER{hi) nER(/i)^ 


5 


<12 max max max P(ER(hj) H ER(/i)) 

iG{l,2,3} iG{l,2,3}\{i} /iGL(A(So;T)) 


1800 / /18 

< r + ^ 

m + 1 V Vo 


m + 1 V Vo 


Furthermore, by the union bound, the event fliGli 23 } probability at least 

1 — 5. Thus, since this argument holds for any 5 G (0,1) and any finite sequence T with 
/* G C[T], we have succeeded in extending the inductive hypothesis to include m' = m. 

By the principle of induction, we have established the claim that, Vm G N, Vh G (0,1), for 
every finite sequence T of points in X x T with /* G C[T], with probability at least 1 — h, 

eTr{hm,T; /*) < (d + In ^ ^ . (9) 


To complete the proof, we simply take T = id (the empty sequence), and note that, for any 
e, (5 G (0,1), for any value m G N of size at least 

the right hand side of ([9]) is less than e, so that Majority(L(A(-; 0))) achieves a sample 
complexity equal the expression in (fTOll . In particular, this implies 


Af(e, {) < i (d + In (I) ) = O (i (d + Log Q) ) ) , 


5. Remarks 

On the issue of computational complexity, we note that the construction of subsamples by 
A can be quite efficient. Since the branching factor is 3, while |5o| is reduced by roughly 
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a factor of 4 with each recursive call, the total number of subsamples returned by A(iS';0) 
is a sublinear function of IS"!. Furthermore, with appropriate data structures, the opera¬ 
tions within each node of the recursion tree can be performed in constant time. Indeed, as 
discussed above, one can directly determine which data points to include in each subsam¬ 
ple via a simple function of the indices, so that construction of these subsamples truly is 
computationally easy. 


The only remaining signihcant computational issue in the learning algorithm is then the 
efficiency of the sample-consistent base learner L. The existence of such an algorithm L, 


of much investigation for a. variety of concent snaces C fe.g.. Khachivan. 19791: 

Karmarkar. 

1984; 

Valiant. 

1984: 

Pitt and Valiant. 

1988; 

Helmbold. Sloan, and Warmuth. 

1990'). For 


instance, the commonly-used concept space of linear separators admits such an algorithm 
(where L{S) may be expressed as a solution of a system of linear inequalities). One can 
easily extend Theorem [2] to admit base learners L that are improper (i.e., which may return 
classifiers not contained in C), as long as they are guaranteed to return a sample-consistent 
cla ssifier in some hypothe sis s pace 'H of VC dimension 0(d). Furthermore, as di scussed 
by Pitt and Valiant ( 1988h and Haussler. Kearns. Littlestone. and Warmuth (1991), there 
is a simple technique for efficiently converting any efficient PAG learning algorithm for C, 
returning classifiers in T-L, into an efficient algorithm L for finding (with probability 1 — 6 ') 
a classifier in T-L consistent with a given data set S with C[5'] 7 ^ 0. Additionally, though the 
analysis above takes L to be deterministic, this merely serves to simplify the notation in the 
proof, and it is straightforward to generalize the proof to allow randomized base learners 
L, including those that fail to return a sample-consistent classifier with some probability 6 ' 
taken sufficiently small (e.g., 6 ' = (i/(2|A(5; 0)|)). Composing these facts, we may conclude 
that, for any concept space C that is efficiently PAG learnable using a hypothesis space Ti 
of VC dimension 0{d), there exists an efficient PAG learning algorithm for C with optimal 
sample complexity (up to numerical constant factors). 


We conclude by noting that the constant factors obtained in the above proof are quite 
large. Some small refinements are possible within the current approach: for instance, by 
choosing the Si subsequences slightly larger (e.g., (3/10)|S'|), or using a tighter form of the 
Chernoff bound when lower-bounding Aj. However, there are inherent limitations to the 
approach used here, so that reducing the constant factors by more than, say, one order of 
magnitude, may require significant changes to some part of the analysis, and perhaps the 
algorithm itself. For this reason, it seems the next step in the study of Ai(£,5) should be 
to search for strategies yielding refined constant factors. In particular, IWarmuthI (120041 ) 
has conjectured that the one-inclusion graph prediction algorithm also achieves a sample 
complexity of the optimal form. This conjecture remains open at this time. The one- 
inclusion graph predictor is known to achieve the optimal sample complexity in the closely- 
related prediction model of learning (where the objective is to achieve expected error rate 
at most e), w ith a numerical constant factor very close to optimal (' Haussler. Littlestone, 
and Warmuth, Il994| j. It therefore seems likely that a (positive) resolution of Warmuth’s 
one-inclusion graph conjecture may also lead to improvements in constant factors compared 
to the bound on A4(e, d) established in the present work. 
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Appendix A. A Technical Lemma 

The following basic lemma is useful in the proof of Theorem [ 2 !^ 


Lemma 5 For any a,b,ci G [l,oo) and C 2 G [0,oo), 

aln ( Cl ( C 2 + - ) ) < oln(ci(c 2 + e)) + -b. 


Proof If ^ < e, then monotonicity of In(-) implies 


oln ( Cl ( C 2 + - ) ) < aln(ci(c 2 + e)) < aln(ci(c 2 + e)) + -b. 


On the other hand, if - > e, then 


aln ( Cl ( C 2 H— ) ) < aln ( ci max{c 2 , 2}— ) = aln (ci max{c 2 , 2}) + a In ( - 


The first term in the rightmost expression is at most aln(ci(c 2 + 2)) < a ln(ci(c 2 + e)). The 
second term in the rightmost expression can be rewritten as Since x i-)- \n.{x)/x is 

nonincreasing on (e, 00 ), in the case - > e, this is at most -b. Together, we have that 


aln ( Cl ( C 2 + - ) ) < aln(ci(c 2 + e)) + -b 


in this case as well. 
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